Surprises in approximating Levenshtein distances 
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Abstract 

The Levenshtein distance is an important tool for the comparison of symbolic 
sequences, with many appearances in genome research, linguistics and other areas. 
For efficient applications, an approximation by a distance of smaller computational 
complexity is highly desirable. However, our comparison of the Levenshtein with 
a generic dictionary-based distance indicates their statistical independence. This 
suggests that a simplification along this line might not be possible without restricting 
the class of sequences. Several other probabilistic properties are briefly discussed, 
emphasizing various questions that deserve further investigation. 
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1 Introduction 



The Levenshtein (or edit) metric (|Levenshteinl . 11965) is a standard tool to estimate the 
distance between two sequences. It is widely used in linguistics and bioinformatics, and 
for the recognition of text blocks with isolated mistakes. As is well known, its computa- 
tional complexity, when applied to two sequences of (approximately) the same length n, is 
0(n 2 ). Since this is a hurdle in many practical applications, it is desirable to replace, or 
to approximate, the Levenshtein (L) distance by some quantity of smaller (preferably lin- 
ear) comput ational complexit y. Two fast approximation algorithms for edit distances were 
suggested by Ukkonen (jl992h . one based on maximal exact matches, the oth er on suitably 



restric ted subword comparisons between the two sequences; compare also iLippert et al. 
(2002 ). This w ould indeed give 0(n), due to their computability from the suffix tree; see 



Gusfieldl (|l99flh . However, they only provide lower bounds, and hence no complete solution 
of the problem. 

It seems possible to estimate probabilistically, wi th sublinear comp lexity, whether the 



L-distance of two sequences is 'small' or 'large'; see Batu et al. f)2003h . Whether 



an im- 



provement of this rather coarse result or even a replacement of the L-distance is possible, 



with at most linear complexity and a non-probabilistic outcome, seems open. Below, we 
compare the L-distance with a representative dictionary-based distance. Our findings sup- 
port the conclusion that such a simplification might be difficult or even impossible. On the 
way, we highlight some interesting properties that have been neglected so far, but seem 
relevant for a better understanding of such distance concepts. 



2 Comparison of two distances 

To keep discussion and results transparent, we concentrate on two specific distances, and 
on binary sequences. We have also tried a number of obvious alternatives, but they did 
not show any significantly different behaviour. In this sense, the structure of our example 
is more likely typical than exceptional. 

The L-distance d^u, v) of two sequences u and v (not necessarily of equal length) is 
the minimum number of edit o perations (inser tions, deletions, or substitutions) needed to 
transform u into v or vice versa (jOusfieldl . ll999l Ch. 11.2). Though dt(u, v) is closely related 
to the longest common subsequence (LCS) (loc. cit. , Ch. 11.6.2) of u and v (and hence to 
distances based upon it), one important difference lies in the possibility of substitutions. 
So, using the LCS in this context requires some care. For sequences of lengths m and n, 
the computational complexity of calculati ng o?t, (or the LCS) is O (mn), e.g., when based 
on the Needleman-Wunsch algorithm; see ( Ewens and Grant . l2004i Ch. 6.4.2). 



A generic choice for a dictionary-based metric is 

d D (u, v) = c&rd(A(u) A A(v)) , 

where A(u) is the full dictionary of u, i.e., the set of all non-empty subwords of u, and 
AAB = (A U B) \ (A (1 B) is the symmetric difference of A and B. This choice actu- 
ally disregards the goal of computational simplification, but focuses on the full dictionary 
information instead, and thus, in some sense, represents the optimal information on the 
sequences to be compared. It is well known that, using the suffix tree structure, the calcu- 
lation of closely related dictionary-based dist ances is possibl e with linear complexity, e.g., 
by means of Ukkonen's algorithm; compare ( Gusfieldl 19991 Ch. 6). On the other hand, 



further restrictions are likely to reduce the usefulness in relation to the L-distance. 

Both g?l and djy define a metric, i.e., for arbitrary sequen ces u, v and w, the distance 



d G {c?l, g?d} satisfies the axioms of a metric (|Schechterl . 119971 Ch. 2.11): 

(i) < d(u, v) < oo (positivity); 

(ii) d(u, v) — if and only if u — v (non-degeneracy); 

(iii) d(u,v) = d(v,u) (symmetry); 

(iv) d(u, v) < d(u,w) + d(w,v) (triangle inequality). 
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Figure 1: Simulated probability distribution Pl of the L-distance d^ between two random 
sequences of length 500 (dots) and Gaussian approximation (line), with mean 150.84 and 
variance 24.82. 



Less clear is the relation between g?l and g?d- Since one can easily construct pairs 
of sequences that are close in one, but not in the other distance , they are cer tainly not 
equivalent in the strong sense as also used for norms, compare fW erner . 1995, Ch. 1.2) . 
They are equivalent in the weaker sense of generating the same topology fSc hechterl . 11997 , 
Ch. 22.5), which is the discrete topology here. However, this is of little use for the question 
addressed above. The situation does not improve if one replaces d(u, v) by the quotient 
d(u,v)/(l + d(u,v)), which is another metric, with range in [0, 1]. As we shall see below, 
the situation is actually much worse. 



3 Concrete results 

To get a first impression of the L-distance, we computed the discrete probability distribu- 
tion of the values d^u, v) for sequences u ^ v of the same length, under uniform distribu- 
tion on sequence space. This has long been known to be a reas onable first approach for the 



comparison of sequences from data bases ([Reich et all [1984). Up to length 20, this was 



done using all possible pairs; for longer sequences, the distribution was estimated from a 
sufficiently large random selection of pairs. For length n = 500, the result obtained from 
4 x10 s pairs is shown in Figure H For large n, the distributions seem to be well described 
by Gaussian (or normal) distributions. This qualitative behaviour does not change much 
and seems to improve with sequence length. One could add weight to this finding by per- 



3 




Figure 2: Mean M^(n) of the probability distribution Pl as a function of sequence length 
n, calculated exactly for n < 20 and by simulation otherwise. The solid line shows the 
least squares fit M h (n) = 0.413 y/n + 0.283 n. 



forming a statistical test on Gaussianity, which would score well. However, we think that 
one should not over-interpret this observation, in particular in view of a recent numerical 
investigation by iPang et al\ (j2005f ) which indicates that a gamma distribution might give 
an even better description. 

Note that , if extrema over local ali gnments are taken, one obtains an extremal value 
distribution ( Pearson and Wood . 200 ll Ch. 2.3.2). However, this implies nothing for the 
global alignment considered here. The possible (or approxim ate) Gaussian nature of this 
case has been observed before by Dayhoff, see ([Mountl . EoOli Ch. 3) a nd references give n 
there; a more detailed investigation of tail probabilities can be found in IWaterman ( 1994 ). 
Still, it seems to be hardly noted, although it is a relevant phenomenon that deserved 
further attention, with exact results presently not in sight. 

For this reason, we could only investigate our findings numerically. Beyond checking 
the Gaussian behaviour qualitatively, means and variances were calculated for different n, 
both by exact enumeration (for n < 20) and by simulation (for larger n, up to n = 1000). 
It is an interesting question whether the mean and the variance, as functions of sequence 
length, show power-law behaviour, at least asymptotically. Our data, see Figures El and El 
are compatible with an asymptotically linear growth of the mean and an asymptotic n 2 ^ 3 
power law for the variance, both with a square-root correction term (for which we do not 
have any particular justification). Su ch predictions and conjectures are presently discussed 
by various people ( Matzingerl . 12004 ). In particular, the n 2//3 power law for the variance 
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Figure 3: Variance VL(n) of the probability distribution Pl as a function of sequence length 
n, calculated exactly for n < 20 and by simulation otherwise. The solid line shows the 
least squares fit V L {n) = -0.283^ + 0.498 n 2 / 3 . 



would be in line with analogous observations for the LCS, compare lHwa and Lassid (11 996). 



Since there has recently been some doubt in the correctness of this finding (jMatzinger 



20041 ). it requires further corroboration and investigation. 

A similar finding (though with larger fluctuations) applies to the distribution of the 
values d-o(u,v) for random pairs u^v. However, there is no compelling reason to investi- 
gate this specific distance in detail, as it was mainly selected for illustrative purposes and 
does not seem to be closely related to one of the standard problems of probability theory. 

More interesting, and also more relevant, is the question for the joint distribution 
of dj)(u,v) and dj,(u, v). A necessary requirement for a useful relation between the two 
distances would be a strong correlation. However, as Figure 0] shows for sequences of length 
100, there is little correlation at all - the joint distribution is rather well described by the 
product of the two Gaussians needed for the marginal distributions. This observation could 
be quantified with some effort, but we refrain from doing so because it would not contribute 
to the interpretation at this stage. 

Our finding means that, at least on the level of the full sequence space or for the 
alignment of two random sequences (as analyzed in our simulations), the distances dv(u, v) 
and di,(u, v) are closer to being statistically independent of each other than to being useful 
approximations of one another. 
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Figure 4: Numerical approximation of the joint probability distribution P for and 
g?l, obtained from a simulation with 10 7 random pairs of sequences of length 100. Both 
marginal distributions are approximately Gaussian, with slightly larger fluctuations for the 
distribution of d D -values. 

4 Concluding remarks 

Our findings are to be interpreted with care. They do not rule out a simplified approach 
to L-type distances, at least when restricted to (possibly relevant) subsets of sequences. 
However, they seem to indicate that subword comparison leads to statistically independent 
information, at least when viewed on the full sequence space. Clearly, different distance 
concepts can and should be tried. Moreover, a rigorous stochastic analysis of the various 
limit distributions is necessary to clarify the picture obtained from the simulations. 

As long as analytic results (e.g., via limit theorems) are unavailable, it would also help 
to perform a more detailed statistical analysis of the various distributions, including clear- 
cut statistical tests. In particular, it would be extremely relevant to also consider suitable 
subspaces of the full sequence space, such as those extractable from existing data bases. 
Though this is clearly far beyond the scope of this short note, we believe that it would be 
a rewarding task for future investigations. 
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